CUT&Tag 数据处理与分析教程三：BAM 文件统计（CUT&Tag 不建议去重不去重不去重）

工具人~ 狮山生信 2022-08-15

CUT&Tag 数据处理与分析教程二：质控（不需要修剪 reads！不需要修剪 reads！不需要修剪reads！）和数据比对

本章放在前面最重要的内容，你可以不继续看，但是这段话你必须好好思考：

PS: 以往在 ChIP-seq 数据分析中去重复步骤大部分都是要做的，在这里作者是建议最好不去重~ （我觉得这种东西按照原作者来就好~ 哈哈，我有文章引用，不知道大家怎么看）。

“
CUT&Tag 将引物连接到 antibody-tethered pA-Tn5 附近的 DNA 中，而整合位点的可信度受到周围 DNA 可及性的影响。因此，具有 相同起始和结束位置的片段是常见的，而这种重复可能不是由于 PCR 过程中的重复造成的。在实践中，我们发现对于高质量的 CUT&Tag 数据集，表面重复率很低，甚至表面重复的片段很可能是真实的片段。因此，我们不建议删除重复 reads。在使用非常少量的材料进行的实验中，或者在怀疑存在 PCR 重复的地方，重复的部分可以被移除 。

在之前我们已经进行将数据回比到基因组上了，那么我们首先肯定是先知道我到底有多少 reads 能够回比到基因组上（也就是说有多少 reads 是来源我们的物种中）。我们可以通过 bowtie2 的比对日志文件 ${projPath}/alignment/sam/bowtie2)summary/${histName}_bowtie2.txt 中可以看到详细的比对信息，并且我们所见到的内容应该和以下内容格式一致：

2984630 reads; of these:
  2984630 (100.00%) were paired; of these:
    125110 (4.19%) aligned concordantly 0 times
    2360430 (79.09%) aligned concordantly exactly 1 time
    499090 (16.72%) aligned concordantly >1 times
95.81% overall alignment rate

其中每一行所代表的具体意思为：

2984640 为测序深度，比如：双端 reads 总的数目
125110 为双端数据中不能回帖到基因组的 reads 数目
2360430 + 499090 为双端数据中能回帖到基因组的 reads 数目
95.81% 表示总体的比对率

从此结果报告中我们可以看到大多数 reads 都是能够回比到基因组中，也同时表明其中受到其他因素的污染程度较少。

测序比对报告汇总（需要）：

汇总原始 reads 和唯一比对 reads，以得到有效比对率。对于高质量数据比对效率要 >80%。CUT&Tag 数据通常具有 非常低的背景噪音，所以只需要 100 万条被回帖的片段就可以为人类基因组中的组蛋白修饰提供可靠的资料。对含量较低的转录因子和染色质蛋白的分析，下游分析可能需要 10 倍的测序片段。我们可以评估以下指标:

测序深度：Sequencing depth
比对率：Alignment rate
比对上的片段数目：Number of mappable fragments
重复率：Duplication rate
唯一比对的文库大小：Unique library size
片段大小分布：Fragment size distribution

计算测序深度

那么我们怎么来计算我们所测得数据的测序深度呢？

##=== R 命令 ===## 
## Path to the project and histone list
projPath = "/fh/fast/gottardo_r/yezheng_working/cuttag/CUTTag_tutorial"
sampleList = c("K27me3_rep1", "K27me3_rep2", "K4me3_rep1", "K4me3_rep2", "IgG_rep1", "IgG_rep2")
histList = c("K27me3", "K4me3", "IgG")
## Collect the alignment results from the bowtie2 alignment summary files
alignResult = c()
for(hist in sampleList){
  alignRes = read.table(paste0(projPath, "/alignment/sam/bowtie2_summary/", hist, "_bowtie2.txt"), header = FALSE, fill = TRUE)
  alignRate = substr(alignRes$V1[6], 1, nchar(as.character(alignRes$V1[6]))-1)
  histInfo = strsplit(hist, "_")[[1]]
  alignResult = data.frame(Histone = histInfo[1], Replicate = histInfo[2], 
                           SequencingDepth = alignRes$V1[1] %>% as.character %>% as.numeric, 
                           MappedFragNum_hg38 = alignRes$V1[4] %>% as.character %>% as.numeric + alignRes$V1[5] %>% as.character %>% as.numeric, 
                           AlignmentRate_hg38 = alignRate %>% as.numeric)  %>% rbind(alignResult, .)
}
alignResult$Histone = factor(alignResult$Histone, levels = histList)
alignResult %>% mutate(AlignmentRate_hg38 = paste0(AlignmentRate_hg38, "%"))

Spike-in 比对信息

##=== R command ===## 
spikeAlign = c()
for(hist in sampleList){
  spikeRes = read.table(paste0(projPath, "/alignment/sam/bowtie2_summary/", hist, "_bowtie2_spikeIn.txt"), header = FALSE, fill = TRUE)
  alignRate = substr(spikeRes$V1[6], 1, nchar(as.character(spikeRes$V1[6]))-1)
  histInfo = strsplit(hist, "_")[[1]]
  spikeAlign = data.frame(Histone = histInfo[1], Replicate = histInfo[2], 
                          SequencingDepth = spikeRes$V1[1] %>% as.character %>% as.numeric, 
                          MappedFragNum_spikeIn = spikeRes$V1[4] %>% as.character %>% as.numeric + spikeRes$V1[5] %>% as.character %>% as.numeric, 
                          AlignmentRate_spikeIn = alignRate %>% as.numeric)  %>% rbind(spikeAlign, .)
}
spikeAlign$Histone = factor(spikeAlign$Histone, levels = histList)
spikeAlign %>% mutate(AlignmentRate_spikeIn = paste0(AlignmentRate_spikeIn, "%"))

汇总回帖到 hg38 和大肠杆菌 E.coli 信息

##=== R command ===## 
alignSummary = left_join(alignResult, spikeAlign, by = c("Histone", "Replicate", "SequencingDepth")) %>%
  mutate(AlignmentRate_hg38 = paste0(AlignmentRate_hg38, "%"), 
         AlignmentRate_spikeIn = paste0(AlignmentRate_spikeIn, "%"))
alignSummary

测序深度和回帖结果可视化

##=== R 命令 ===## 
## Generate sequencing depth boxplot
fig3A = alignResult %>% ggplot(aes(x = Histone, y = SequencingDepth/1000000, fill = Histone)) +
    geom_boxplot() +
    geom_jitter(aes(color = Replicate), position = position_jitter(0.15)) +
    scale_fill_viridis(discrete = TRUE, begin = 0.1, end = 0.9, option = "magma", alpha = 0.8) +
    scale_color_viridis(discrete = TRUE, begin = 0.1, end = 0.9) +
    theme_bw(base_size = 18) +
    ylab("Sequencing Depth per Million") +
    xlab("") + 
    ggtitle("A. Sequencing Depth")
fig3B = alignResult %>% ggplot(aes(x = Histone, y = MappedFragNum_hg38/1000000, fill = Histone)) +
    geom_boxplot() +
    geom_jitter(aes(color = Replicate), position = position_jitter(0.15)) +
    scale_fill_viridis(discrete = TRUE, begin = 0.1, end = 0.9, option = "magma", alpha = 0.8) +
    scale_color_viridis(discrete = TRUE, begin = 0.1, end = 0.9) +
    theme_bw(base_size = 18) +
    ylab("Mapped Fragments per Million") +
    xlab("") +
    ggtitle("B. Alignable Fragment (hg38)")
fig3C = alignResult %>% ggplot(aes(x = Histone, y = AlignmentRate_hg38, fill = Histone)) +
    geom_boxplot() +
    geom_jitter(aes(color = Replicate), position = position_jitter(0.15)) +
    scale_fill_viridis(discrete = TRUE, begin = 0.1, end = 0.9, option = "magma", alpha = 0.8) +
    scale_color_viridis(discrete = TRUE, begin = 0.1, end = 0.9) +
    theme_bw(base_size = 18) +
    ylab("% of Mapped Fragments") +
    xlab("") +
    ggtitle("C. Alignment Rate (hg38)")
fig3D = spikeAlign %>% ggplot(aes(x = Histone, y = AlignmentRate_spikeIn, fill = Histone)) +
    geom_boxplot() +
    geom_jitter(aes(color = Replicate), position = position_jitter(0.15)) +
    scale_fill_viridis(discrete = TRUE, begin = 0.1, end = 0.9, option = "magma", alpha = 0.8) +
    scale_color_viridis(discrete = TRUE, begin = 0.1, end = 0.9) +
    theme_bw(base_size = 18) +
    ylab("Spike-in Alignment Rate") +
    xlab("") +
    ggtitle("D. Alignment Rate (E.coli)")
ggarrange(fig3A, fig3B, fig3C, fig3D, ncol = 2, nrow=2, common.legend = TRUE, legend="bottom")

“
在一个典型的 CUT&Tag 实验中，针对 65000 个 K562 细胞中富集的 H3K27me3 组蛋白修饰，大肠杆菌的比对率在 0.01% 到 10% 之间。在细胞较少或表位较少的情况下，大肠杆菌的 reads 可占总回帖 reads 的 70%。对于 IgG 对照组，大肠杆菌 reads 的百分比通常比富集的组蛋白修饰高得多。

去除重复 reads（可选 / 需要）

CUT&Tag 将引物连接到 antibody-tethered pA-Tn5 附近的 DNA 中，而整合位点的可信度受到周围 DNA 可及性的影响。因此，具有 相同起始和结束位置的片段是常见的，而这种重复可能不是由于 PCR 过程中的重复造成的。在实践中，我们发现对于高质量的 CUT&Tag 数据集，表面重复率很低，甚至表面重复的片段很可能是真实的片段。因此，我们不建议删除重复 reads。在使用 非常少量的材料进行的实验中，或者在怀疑存在 PCR 重复的地方，重复的部分可以被移除 。下面的命令显示如何使用 Picard 检查重复率。

##== linux 命令 ==##
## depending on how you load picard and your server environment, the picardCMD can be different. Adjust accordingly.
picardCMD="java -jar picard.jar"
mkdir -p $projPath/alignment/rmDuplicate/picard_summary
## Sort by coordinate
## 按照坐标排序
$picardCMD SortSam \
I=$projPath/alignment/sam/${histName}_bowtie2.sam \
O=$projPath/alignment/sam/${histName}_bowtie2.sorted.sam \
SORT_ORDER=coordinate
## mark duplicates
## 标记重复
$picardCMD MarkDuplicates \
I=$projPath/alignment/sam/${histName}_bowtie2.sorted.sam \
O=$projPath/alignment/removeDuplicate/${histName}_bowtie2.sorted.dupMarked.sam \
METRICS_FILE=$projPath/alignment/removeDuplicate/picard_summary/${histName}_picard.dupMark.txt
## remove duplicates
## 去除重复 reads 
picardCMD \
MarkDuplicates I=$projPath/alignment/sam/${histName}_bowtie2.sorted.sam \
O=$projPath/alignment/removeDuplicate/${histName}_bowtie2.sorted.rmDup.sam \
REMOVE_DUPLICATES=true \
METRICS_FILE=$projPath/alignment/removeDuplicate/picard_summary/${histName}_picard.rmDup.txt

我们汇总了表面上的重复率，并计算了不重复的唯一文库的大小。

##=== R 命令 ===## 
## Summarize the duplication information from the picard summary outputs.
dupResult = c()
for(hist in sampleList){
  dupRes = read.table(paste0(projPath, "/alignment/rmDuplicate/picard_summary/", hist, "_picard.rmDup.txt"), header = TRUE, fill = TRUE)
  
  histInfo = strsplit(hist, "_")[[1]]
  dupResult = data.frame(Histone = histInfo[1], Replicate = histInfo[2], MappedFragNum_hg38 = dupRes$READ_PAIRS_EXAMINED[1] %>% 
  as.character %>% as.numeric, DuplicationRate = dupRes$PERCENT_DUPLICATION[1] %>% 
  as.character %>% as.numeric * 100, EstimatedLibrarySize = dupRes$ESTIMATED_LIBRARY_SIZE[1] %>% 
  as.character %>% 
  as.numeric) %>% 
  mutate(UniqueFragNum = MappedFragNum_hg38 * (1-DuplicationRate/100))  %>% 
  rbind(dupResult, .)
}
dupResult$Histone = factor(dupResult$Histone, levels = histList)
alignDupSummary = left_join(alignSummary, dupResult, by = c("Histone", "Replicate", "MappedFragNum_hg38")) %>% 
mutate(DuplicationRate = paste0(DuplicationRate, "%"))
alignDupSummary

这些示例数据集中，IgG 对照样本具有相对较高的重复率，因为该样本中的 reads 来自于 CUT&Tag 反应中的非特异性标记。因此，在进行下游分析之前，从 IgG 数据集中删除重复数据是合适的。
估计库大小是 Picard 计算的基于双毒数据重复的文库中唯一 reads 的估计数。
估计的文库大小与目标表位的丰度和所用抗体的质量成正比，而 IgG 样本的估计文库大小预计很低。
唯一片段数由 MappedFragNum_hg38 * (1-DuplicationRate/100) 计算。


##=== R i命令 ===## 
## generate boxplot figure for the  duplication rate
fig4A = dupResult %>% ggplot(aes(x = Histone, y = DuplicationRate, fill = Histone)) +
    geom_boxplot() +
    geom_jitter(aes(color = Replicate), position = position_jitter(0.15)) +
    scale_fill_viridis(discrete = TRUE, begin = 0.1, end = 0.9, option = "magma", alpha = 0.8) +
    scale_color_viridis(discrete = TRUE, begin = 0.1, end = 0.9) +
    theme_bw(base_size = 18) +
    ylab("Duplication Rate (*100%)") +
    xlab("") 
fig4B = dupResult %>% ggplot(aes(x = Histone, y = EstimatedLibrarySize, fill = Histone)) +
    geom_boxplot() +
    geom_jitter(aes(color = Replicate), position = position_jitter(0.15)) +
    scale_fill_viridis(discrete = TRUE, begin = 0.1, end = 0.9, option = "magma", alpha = 0.8) +
    scale_color_viridis(discrete = TRUE, begin = 0.1, end = 0.9) +
    theme_bw(base_size = 18) +
    ylab("Estimated Library Size") +
    xlab("") 
fig4C = dupResult %>% ggplot(aes(x = Histone, y = UniqueFragNum, fill = Histone)) +
    geom_boxplot() +
    geom_jitter(aes(color = Replicate), position = position_jitter(0.15)) +
    scale_fill_viridis(discrete = TRUE, begin = 0.1, end = 0.9, option = "magma", alpha = 0.8) +
    scale_color_viridis(discrete = TRUE, begin = 0.1, end = 0.9) +
    theme_bw(base_size = 18) +
    ylab("# of Unique Fragments") +
    xlab("")
ggarrange(fig4A, fig4B, fig4C, ncol = 3, common.legend = TRUE, legend="bottom")

评估回帖片段大小分布

CUT&Tag 插入接头在连接酶附近的染色质的任何一边，尽管标签 tagmentation 在染色质也可以发生。因此，针对组蛋白修饰的 CUT&Tag 反应主要导致的片段是核染色体长度 (~180 bp)，或者是 该长度的倍数。 CUT&Tag 靶向转录因子 主要分别从邻近的核小体和因子结合位点产生核小体大小的片段和数量不等的短片段。核小体表面的 DNA 标记也会发生，用单碱基分辨率绘制片段长度显示了 10-bp 的锯齿周期性，这是 成功的 CUT&Tag 实验的典型特征。

##== linux 命令 ==##
mkdir -p $projPath/alignment/sam/fragmentLen
## Extract the 9th column from the alignment sam file which is the fragment length
samtools view -F 0x04 $projPath/alignment/sam/${histName}_bowtie2.sam |\
awk -F'\t' 'function abs(x){return ((x < 0.0) ? -x : x)} {print abs($9)}' |\
sort |\
uniq -c |\
awk -v OFS="\t" '{print $2, $1/2}' >$projPath/alignment/sam/fragmentLen/${histName}_fragmentLen.txt

##=== R 命令 ===## 
## Collect the fragment size information
fragLen = c()
for(hist in sampleList){
  
  histInfo = strsplit(hist, "_")[[1]]
  fragLen = read.table(paste0(projPath, "/alignment/sam/fragmentLen/", hist, "_fragmentLen.txt"), header = FALSE) %>% mutate(fragLen = V1 %>% as.numeric, fragCount = V2 %>% as.numeric, Weight = as.numeric(V2)/sum(as.numeric(V2)), Histone = histInfo[1], Replicate = histInfo[2], sampleInfo = hist) %>% rbind(fragLen, .) 
}
fragLen$sampleInfo = factor(fragLen$sampleInfo, levels = sampleList)
fragLen$Histone = factor(fragLen$Histone, levels = histList)
## Generate the fragment size density plot (violin plot)
fig5A = fragLen %>% ggplot(aes(x = sampleInfo, y = fragLen, weight = Weight, fill = Histone)) +
    geom_violin(bw = 5) +
    scale_y_continuous(breaks = seq(0, 800, 50)) +
    scale_fill_viridis(discrete = TRUE, begin = 0.1, end = 0.9, option = "magma", alpha = 0.8) +
    scale_color_viridis(discrete = TRUE, begin = 0.1, end = 0.9) +
    theme_bw(base_size = 20) +
    ggpubr::rotate_x_text(angle = 20) +
    ylab("Fragment Length") +
    xlab("")
fig5B = fragLen %>% ggplot(aes(x = fragLen, y = fragCount, color = Histone, group = sampleInfo, linetype = Replicate)) +
  geom_line(size = 1) +
  scale_color_viridis(discrete = TRUE, begin = 0.1, end = 0.9, option = "magma") +
  theme_bw(base_size = 20) +
  xlab("Fragment Length") +
  ylab("Count") +
  coord_cartesian(xlim = c(0, 500))
ggarrange(fig5A, fig5B, ncol = 2)

“
较小的片段 (50-100 bp) 可能是由于 Tn5 可以固定在核小体的表面和连接区域，所以这些小片段可能不是背景。

评估重复性

重复样本之间的数据重复性是通过对整个基因组的回帖 reads 数进行相关分析来评估的。为了实现的简单性，我们将在第四节之后将该文件格式转换为片段 bed 文件进行分析。

古树普洱茶山头直发，买茶别再花冤枉钱！

19岁中国留学生投票被抓，“假装”公民身份！且已无法撤回.........

中国留学生在美国非法投票，后果很严重

19岁中国留学生非法投票美国大选，被控2罪！或被判15年监禁

恶魔医生刘翔峰判了，湘雅二院改好了吗？

CUT&Tag 数据处理与分析教程三：BAM 文件统计（CUT&Tag 不建议去重不去重不去重）

测序比对报告汇总（需要）：

计算测序深度

Spike-in 比对信息

汇总回帖到 hg38 和大肠杆菌 E.coli 信息

测序深度和回帖结果可视化

去除重复 reads（可选 / 需要）

评估回帖片段大小分布

评估重复性

您可能也对以下帖子感兴趣

古树普洱茶山头直发，买茶别再花冤枉钱！

19岁中国留学生投票被抓，“假装”公民身份！且已无法撤回.........

中国留学生在美国非法投票，后果很严重

19岁中国留学生非法投票美国大选，被控2罪！或被判15年监禁

恶魔医生刘翔峰判了，湘雅二院改好了吗？

生成图片，分享到微信朋友圈

CUT&Tag 数据处理与分析教程 三：BAM 文件统计（CUT&Tag 不建议去重不去重不去重）

测序比对报告汇总（需要）：

计算测序深度

Spike-in 比对信息

汇总回帖到 hg38 和大肠杆菌 E.coli 信息

测序深度和回帖结果可视化

去除重复 reads（可选 / 需要）

评估回帖片段大小分布

评估重复性

您可能也对以下帖子感兴趣

CUT&Tag 数据处理与分析教程三：BAM 文件统计（CUT&Tag 不建议去重不去重不去重）